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ABSTRACT 



To evaluate student learning in a computer- supported 
environment known as "GenScope," a system was developed for assessing 
students 1 understanding and learning of introductory genetics material 
presented in two developed GenScope instruments. Both quantitative and 
qualitative methods were used to address traditional evidential validity 
concerns as well as more contemporary concerns with consequential and 
systemic validity. Findings from three GenScope implementation classrooms and 
interviews with two teachers and five secondary school students show strong 
evidential validity, but only limited consequential validity. In response to 
these findings, a set of curricular activities was developed to scaffold 
student assessment performance without compromising the evidential validity 
of the assessment system. The study shows the usefulness of newer 
interpretive models of validity inquiry and the value of multifaceted Rasch 
measurement tools for conducting such inquiry. Two appendixes contain sample 
items from one assessment and a sample GenScope investigation. (Contains 3 
tables, 5 figures, and 23 references.) (SLD) 



***************************************************************************** 

* Reproductions supplied by EDRS are the best that can be made 

* from the original document. 



TM029301 ED 426 077 



Evidential and Systemic Validity 



Assessing Learning in a Technology-Supported Genetics Environment: Evidential and 

Systemic Validity Issues 



U.S. DEPARTMENT OF EDUCATION 
Office of Educational Research and Improvement i 



EDUCATIONAL RESOURCES INFORMATION 
/ CENTER (ERIC) 

3 This document has been reproduced as 
received from the person or organization 
originating it. 



Minor changes have been made to 
improve reproduction quality. 



Daniel T. Hickey, Georgia State University 
Edward W. Wolfe, University of Florida 
Ann C. H. Kindfield, Montclair State University \ 

PERMISSION TO REPRODUCE AND 
DISSEMINATE THIS MATERIAL . 

HAS BEEN GRANTED BY I 

j) fljtugA \ • i 



Points of view or opinions stated in this 
document do not necessarily represent ' 
official OERI position or policy. J 



April, 1998 



TO THE EDUCATIONAL RESOURCES I 
INFORMATION CENTER (ERIC) j 



Presented at the Annual Meeting of the American Educational Research Association, 

San Diego, CA. 



Author Notes 

This research was partly supported by a Postdoctoral Fellowship from Educational Testing 
Service and was initiated when all three authors were affiliated with the Center for Performance 
Assessment at ETS. The GenScope Assessment Project is funded by the National Science 
Foundation (Project # RED-95-53438), via The Concord Consortium. We gratefully 
acknowledge the input of colleagues, including Joan Heller, Carol Myford, and Drew Gitomer, 
Bob Mislevy, and Iris Tabak at ETS, Paul Horwitz, Mary Ann Christie, and Joyce Schwartz at 
the Concord Consortium, and Alex Heidenberg at Georgia State University. Thanks also to the 
: teachers and students at Lincoln-Sudbury High and Boston High. For information, contact 
! Daniel T. Hickey, Dept EPSE, Georgia State University, Atlanta, GA, 30303, or 
^ dhickey@gsu.edu. 




Evidential and Systemic Validity 



Abstract 

In order to evaluate student learning in a computer-supported environment known as GenScope, 
we developed a system for assessing students' understanding and learning of introductory 
genetics A critical aspect of the development effort concerned the validity of this assessment 
system. We used quantitative and qualitative methods to address traditional evidential validity 
concerns as wejl as more contemporary concerns with consequential and systemic validity 
Specifically, we examined whether or not our assessment system helped students develop the 
understanding it was designed to assess. Our inquiry revealed strong evidential validity, but only 
limited consequential validity. In response we developed a set of curricular activities designed to 
scaffold student assessment performance without compromising the evidential validity of the 
assessment system. In addition to documenting and enhancing the system’s validity, these 
efforts demonstrate the utility of newer interpretive models of validity inquiry and the value of 
mutifaceted Rasch measurement tools for conducting such inquiry 
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Assessing Learning in a Technology-Supported Genetics Environment: 
Evidential and Systemic Validity Issues 



The work described here was a result of our participation in a multi-year implementation 
and evaluation effort involving a computer-supported learning environment known as GenScope 
(Horwitz, Neumann, & Schwarz, 1996; see http://GenScope.concord.org). GenScope was 
designed primarily for teaching introductory genetics in secondary Biology classrooms As 
illustrated in Figure 1, the GenScope software employs fanciful species such as dragons as well 
as real species, and lets students observe and manipulate the dynamic relationships across the 
various levels of biological organization In key respects, this application of educational 
computing is consistent with recent policy recommendations for K-12 educational technology 
issued by the President s Committee of Advisors on Science and Technology (PCAST, 1997). 
Specifically, the software and associated curriculum were designed to help students develop the 
kind of higher-level domain reasoning skills called for by current science education standards 

(e g . National Research Council, 1996) and embody contemporary constructivist pedagogical 
principles 

A key challenge in our research effort was developing an assessment system for 
documenting the degree to which students could demonstrate the kinds of domain reasoning that 
GenScope ostensibly affords. We needed an assessment system that was consistent with the 
pedagogical assumptions embodied in GenScope while also affording the sort of rigorous 
evaluation of learning outcomes that are also called for in current policy recommendations (i e 
PCAST, 1997). A major part of this challenge — and the focus of this paper — concerns the 
validity of this assessment system. This paper describes the assessment system that we 
developed and our inquiry into its validity This inquiry used interpretive and empirical methods 
to address traditional evidential validity concerns as well as more contemporary concerns with 
consequential validity (Messick, 1989) and systemic validity (Frederiksen & Collins, 1989) In 
particular, our inquiry considered whether the assessment system further contributed to student 
learning, and whether or not it had done so at the expense of the system’s evidential validity 

Validity Inquiry as Argumentation 
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Assuming that psychological conclusions (including ones about validity) are at some 
level undetermined (i.e., subject to multiple interpretations), one must marshal evidence in favor 
of one's own reasoned interpretation and against alternative interpretations. A prerequisite to 
scientific argumentation is setting forth the assumptions and guiding conceptualizations of the 
world in which the argument takes place. Overarching assumptions, in the form of "world 
views" provide communities of scientists with the shared lore of what "counts" as evidence— 
specifically what constitutes legitimate research questions, acceptable experiments to test those 
questions, and legitimate data from those experiments Following are the assumptions about 
knowing and learning, transfer, assessment, and validity that guided our research and frame the 
arguments that warrant our conclusions. 

Assumptions about Knowing and Learning 

Socio-constructivist/situative epistemological perspectives such as situated cognition 
(e g . Brown, Collins, & Duguid, 1989) hold that knowledge and skills are fundamentally 
contextualized (i.e., "situated") in the physical and social context in which they are acquired and 
used Skills and knowledge are conceptualized as being distributed across the social and 
physical environment, jointly composed in a system that comprises an individual and peers, 
teachers, and culturally provided tools. From this perspective, complex cognitive performances 
usually require external tools, such as pencil, paper, computers, books, peers, teachers, etc 
Furthermore, people with less education and skill rely more on these tools for complex thinking 
than their more proficient counterparts. Rather than something that can be "possessed", 
proficiency in a domain is seen in part as knowing how to overcome the limits of mind and 
memory 

Technology-based tools such as GenScope make complex relationships and interactions 
in particular domains visible and manipulate, allowing students to test their ideas and 
understanding. Akin to the Cuisinare rods that have dramatically reshaped primary mathematics 
instruction, the various windows in GenScope provide manipulate representations of a 
simplified genetic system of imaginary and real organisms. Because the representations are 
dynamically linked, students can control the abstract processes and observe the components 
across the various "levels" of biological organization where genetics is manifested (i.e . DNA. 
chromosomal, cellular, Mendelian, and evolutionary). From traditional empiricist/associatioriist 
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or pragmatist/rationalist epistemological perspectives, GenScope would be seen as new tool to 
teach genetics via demonstrations and routine exercises, or as a discovery learning environment 
in which students can "discover" important domain concepts. In contrast, contemporary 
constructivist perspectives view GenScope as a tool that affords a structured environment where 
learners collaboratively experience, learn, and demonstrate a more sophisticated understanding 
of complex relationships and phenomena than they could otherwise. From this perspective, 
students initially "understand" the domain represented by a simulation as they internalize the 
language, representations, and relations in that environment. This internalization happens both 
as students interact with the environment, and as students interact with each other within that 
environment. Consider, for example, two students using GenScope to solve complex inheritance 
problems (e g., sex linkage or crossover) while struggling with fragile understandings of the 
underlying concepts such as chromosome type and meiotic events. The common representation 
afforded by their shared understanding of the simulation environments supports a more 
sophisticated level of interactions than would be possible without such a tool. Meanwhile, the 
associated curricular activities and the teacher help these students connect their shared activity to 
the broader domain (textbook depictions of genetics, other biological domains, other sciences, 
etc ). As individuals internalize the shared understanding of concepts and phenomenon that are 
"stretched across" this environment, they move closer and closer to the goal of "expert" 
understanding in the domain This exemplifies precisely how contemporary instructional 
theorists believe that software tools can facilitate learning by extending and expanding what 
Vygotsky (1978) characterized as “the zone of proximal development" 

Assumptions about Transfer 

A key aspect of any interpretation of assessment performance is whether it demonstrates 
transfer of knowledge from the learning situation represented by particular learning 
environment. In the typical absence of a known transfer situation (such as an employment 
setting or a subsequent course), the actual transfer situation is unknown, and the assessments 
themselves are the transfer situation. Thus, the validity of one's interpretation of student 
performance is partly contingent on the appropriateness of the assessments as criteria of 
performance itself, or as surrogates for some other unspecified transfer setting. 
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From a situated cognition perspective, transfer is considered in terms of the constraints 
and affordances that support activity in the learning situation and in the transfer situation 
(Greeno, Collins, & Resnick, 1996). Analyzing transfer involves analyzing the 
"transformations" that relate a given pair of learning and transfer situations For any transfer, 
some constraints and affordances must be the same (be "invariant") across both situations If 
transfer is to take place, the learner must learn (become "attuned" to) these invariants in the 
initial learning situation. In order to interpret the degree of transfer represented by performance 
on assessments, one must first identify the dimensions that vary between the learning situation 
(i.e., the GenScope environment or a comparison genetics learning environment) and the transfer 
situation (i.e., our various assessment tasks). For example, one dimension of transfer in our 
research concerned the way the organism’s genotype was represented. As illustrated in Figure 1 , 
GenScope’s chromosome window provides a colorful depiction of the organisms’ various allelic 
combinations (i.e , AA vs. Aa vs. aa) that is dynamically linked to other representations; in 
contrast, our assessments used the traditional “stick figure” representation of the organism’s 
genome. If students’ understanding of genotypic representation is to transfer from the GenScope 
environment to the assessment environment, they must be able to distinguish between the aspects 
of the representation that are particular to GenScope and the aspects that are invariant (i.e., the 
domain-relevant information that is conveyed by both representations). 

Assumptions about Assessment. 

Traditional assessment approaches that tested whether or not an individual "possessed" 
proficiency were premised on two key assumptions— that knowledge can be decomposed into 
elements, and that knowledge can be decontextualized in a manner that it can exist or be 
measured free of context. The perspectives on knowing and learning described above have led 
many contemporary theorists to reject both assumptions, offering critical implications for 
assessment: % 

Any individual has a range of knowledge and competencies, rather than some fixed level 
of performance. Depending on how much support and familiarity with the materials at 
hand she or he has, an individual's performance will be greatly affected It may be just as 
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crucial to measure the quality of that supported performance--or the gap between solo 
and supported thought (Wolfe, Bixby, Glenn, and Gardner, 1991, p 51). 

Many of the conflicts that emerge when developing assessment frameworks are rooted in 
conflicting assumptions about knowing and learning, and the implications of those assumptions 
for our assumptions about proficiency and transfer. For example, a critical issue concerns the 
difficulty of the assessments relative to the students' abilities. Learning environments such as 
GenScope are designed to focus on "higher-order" domain-specific understanding, rather than 
mere memorization of terms and understanding of simple concepts. Higher-order thinking is 
generally characterized as non-algorithmic, complex, and effortful, and as involving multiple 
solutions, nuanced judgment, the application of multiple criteria, and self-regulation. In our 
development effort, we found that the problems we first designed to assess what we defined as 
higher-order domain understanding were exceedingly difficult for participants in the initial pilot 
implementation of the learning environment. There are arguments both for and against targeting 
a level of performance that few students are likely to attain. Current perspectives on assessment 
suggest that a more fundamental question concerns whether or not we have identified an 
appropriate performance criterion, or whether or not it is appropriate to even select a criterion. 

As illustrated by Wolfe et al above, a central theme of new assessment perspectives is that 
assessments should maximize student performance as much as possible, starting at whatever 
level of performance students are capable of. This is particularly the case if an intended 
consequence of assessment is increasing student understanding. For example, Wiggins argues 

To make tests truly enabling we must do more than just couch test tasks in more authentic 
performance contexts. We must construct tests that assess whether students are learning 
how to learn, given what they know Instead of testing whether students have learned 
how to read, we should test their ability to read to learn, instead of finding out whether 
they "know" formulas, we should find out whether they can use formulas to find other 
formulas (1993, p 214, emphasis added) 

These perspective can be seen as arguing against specifying an "adequate" criterion level of 
performance At a minimum, this perspective suggests that if a criterion is used, it should 



Evidential and Systemic Validity 



concern the degree of support needed for students to perform at an acceptable level, rather than a 
level of performance to be achieved without support Interpretations of student performance can 
then be made in terms of the type and degree of support needed to solve problems that require 
"higher-order understanding". Thus, the range of proficiency might start with a highly 
scaffolded problem, with increasingly higher levels of proficiency indicated by solving the 
similar problems after stripping away more and more layers of support. As we will show, this 
perspective (and our initial experience with unscaffolded higher-level problems) led us to design 
assessments where the easier activities that students first encounter scaffold their performance on 
the more difficult ones that appear later in the test. 

Assumptions about Validity 

In one oft-cited characterization, Messick defines validity as "an integrated evaluative 
judgment of the degree to which empirical evidence and theoretical rationales support the 
adequacy and appropriateness of inferences and actions based on test scores or other modes of 
assessment” (1989, p. 5). Messick (e g., 1989, 1995) discusses validity in terms of the four 
"facets" derived from crossing the distinction between interpreting and using test scores, and the 
distinction between the evidential basis of validity and the consequential basis of validity (shown 
in Table 1). Facet 1 is best understood as the search for construct irrelevant variance. For 
example, do students perform better or worse on an assessments for reasons other than individual 
differences in the underlying targeted construct? Within this explicitly additive framework. 

Facet 2 adds to Facet 1 the need for evidence that supports the relevance of a given score 
interpretation in a particular applied setting. For example, is one's presumably valid 
interpretation of student understanding itself a valid means of assessing the particular learning 
environment? 

The inclusion of the consequences of test use and test score interpretation represents a 
recent, somewhat controversial advance in validity theory. The presumed desirable and 
undesirable consequence of various assessment practices have provided much of the support for 
newer performance assessment methods and assessment-oriented educational reform efforts. 
Facet 3 of Table 1 concerns the intended and unintended consequences of interpreting student 
performance, “the appraisal of value implications of score meaning, including value implications 
of the construct label itself, of the broader theory that undergirds construct meaning, and of the 
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still broader ideologies that give theories their purpose and meaning..." (Messick, 1995, p. 748) 
For example, does the way proficiency is interpreted have important consequences for particular 
groups of students? Facet 4 concerns the intended and unintended social consequences of using 
an assessment system. For example, did a given assessment practice have the desired effect of 
leading the students, teacher, and curriculum developers to focus more on higher-level 
understanding? Did completing the assessment lead them to integrate and organize what they 
already knew or did it lead them to doubt and question what they had learned? 

Messick (1994, 1995) and others (e g., Linn, Baker, & Dunbar, 1991) argue that it is 
particularly important to study the consequences of performance-based assessments because they 
promise positive consequences for learners and learning. These consequences are often cited to 
justify the added expense of performance-based assessment and to justify potential compromises 
to evidential validity. Messick further points out that it is wise to look for both the actual and 
potential consequences. Anticipation of likely outcomes helps us find side effects and determine 
what kinds of evidence are needed to monitor them. Such anticipation "may alert one to take 
timely steps to capitalize on positive effects and ameliorate or forestall negative effects" 
(Messick, 1995, p. 774). 

Advancing a validity perspective that follows from the assumptions on learning and 
assessment embraced in our work (and outlined above), Frederiksen and Collins (1989) further 
emphasized the consequences of assessment practices by introducing the notion of systemic 
validity: 

A systemically valid test is one that induces in the educational system curricular and 
instructional changes that foster the development of the cognitive skills that the test is 
designed to measure. Evidence for systemic validity would be an improvement in those 
skills after the test has been in place within the educational system for a period of time (p 
27). 

Frederiksen and Collins propose a set of principles for the design of systemically valid 
assessment systems, including the components of the system (a representative set of tasks, a 
definition of the primary traits for each subprocess, a library of exemplars, and a training system 
for scoring tests), standards for judging the assessments (directness, scope, reliability, and 



0 

ERIC 



8 



10 



Evidential and Systemic Validity 



transparency) and methods for fostering self-improvement (practice in self-assessment, repeated 
testing, performance feedback, and multiple levels of success). 

In our opinion, Frederiksen and (1989) Collins advance a lofty, but worthy benchmark 
for evaluating assessment practice. Furthermore, documenting a measure of systemic validity 
will address the important concerns that have been raised by the assessment community 
regarding the potential negative consequences of short-answer paper and pencil assessment 
measures. Like many others, our assessment effort was constrained to such a format. We agree 
with Stiggins (1994) and others that, despite their limitations, paper and pencil assessments can 
be thoughtfully used to many some aspects of higher-order domain-specific understanding and to 
further develop that understanding. 

In developing a framework for our validity inquiry, we found Moss' (1993) and Shepard's 
(1993, 1996) criticism of Messick's (1989) perspective invaluable. We initially struggled with 
the distinction between Messick’s different facets with what Messick describes as the 
"progressive" nature of the framework, where construct validity appears in every cell, with 
something more added in each subsequent cell. Shepard (1993, p. 427) argues that Messick's 
faceted presentation implies that "values are distinct from a scientific evaluation of test score 
meaning" and "implicitly equates construct validity with a narrow definition of score meaning " 
Furthermore, the sequential segmentation of validity "gives researchers tacit permission to leave 
out the very issues which Messick has highlighted because the categories of use and 
consequences appear to be tacked on to 'scientific' validity which remains sequestered in the first 
cell " In our case, Messick would have us first evaluate the validity of our interpretation of 
student scores on our assessments (i.e., whether students perform poorly or well for reasons other 
than what we anticipated) before considering the consequences of our interpretation Clearly 
though, the validity of our interpretation of performance is strongly impacted by the 
consequences of that interpretation. If students do not even try to finish a test because they are 
not being graded (i.e., minimal consequences of test use), then scores are an invalid depiction of 
student understanding a priori 

While Shepard agrees with Messick about the scope and range of validity inquiry, her 
differences with Messick's presentation have important implications for the way such inquiry is 
carried out Shepard argues that Messick's framework does not help identify which validity 
questions are essential to support a test's use. This concern seems particularly relevant given 
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typically limited resources available for validity inquiry and the difficulty in prioritizing validity 
research questions. Indeed, we initially intended to focus only on "construct validity" because of 
the complexities of studying consequences of the assessments 

As an alternative to Messick's conceptualization, Shepard (1993, p 428) equates 
construct validity "with the full set of demands implied by all four cells, which all involve score 
meaning." In light of the dilemma described above, Shepard insists that "intended effects 
entertained in the last cell are integrally part of test meaning in applied contexts" In order to 
provide "a more straightforward means to prioritize validity questions," Shepard suggests that 

validity evaluations be organized in response to the question "What does the testing 
practice claim to do?" Additional questions are implied: What are the arguments for and 
against the intended aims of the test? and What does the test do in the system other than 
what it claims, for good or bad? All of Messick's issues should be sorted through at once, 
with consequences as equal contenders alongside domain representativeness as 
candidates for what must be assessed in order to defend test use (1993, p 429-430). 

This view of validity inquiry draws strongly from Cronbach's (1988, 1989) concept of validation 
as evaluation argument, which in turn draws strongly from insights in program evaluation 
regarding the nature of evidence and argumentation, the posing of contending validity questions, 
and the responsibility to consider all of the potential audiences affected by a program Cronbach 
has pointed out that program evaluators do not have the luxury of setting aside issues in the way 
that basic researchers typically do Limited time and resources typically available to program 
evaluators forces them to identify the most relevant questions, and assign priorities depending on 
issues such as prior uncertainty, information yield, cost, and the importance of the questions for 
achieving consensus in the relevant audience 

Kane's (1992) extension of Cronbach's approach conceptualizes validation as the 
evaluation of interpretive argument: 

To validate a test score interpretation [including test uses] is to support the plausibility of 
the corresponding interpretive argument with appropriate evidence The argument-based 
approach to validation adopts the interpretive argument as the framework for collecting 
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and presenting validity evidence and seeks to provide convincing evidence for its 
inferences and assumptions, especially its most questionable assumptions. (1992, p 527) 

Drawing from literature on practical reasoning and evaluation argument. Kane identifies the 
criteria for interpretive argument as the following, (a) the argument must be clearly stated so that 
what is claimed is know; (b) the argument must be coherent in the sense that conclusions follow 
reasonably from the assumptions; and (c) assumptions should be plausible or supported by 
evidence, which includes investigating plausible counterarguements. 

In summary, the interpretive argument approach described by Kane, along with 
Shepard’s characterization of prioritizing one inquiry allow us to examine the validity of our 
assessment system in accordance with the assumptions on knowledge, transfer, and assessment 
described above. Following is a description of the assessment system we developed and the way 
we conducted this inquiry 



Method 

Assessment System 

The larger research agenda dictated several typical constraints for our assessment system 
It needed to be paper-and-pencil, easy to score and interpret, and appropriate and fair for use in 
both implementation (i.e . GenScope) and comparison classrooms Additionally, the assessment 
system needed to satisfy both formative and summative assessment goals, capture the full range 
of genetics reasoning within and between the various levels of biological organization, and be 
consistent with current understanding of the development of reasoning in introductory genetics 
(e g . Stewart & Hafner, 1994, Kindfield, 1994) 

Several design-implementation-revision cycles during the project’s first year yielded two 
instruments. Both instruments were designed around fabricated species with simplified genomes 
consisting of three chromosomes and a handful of characteristics The "NewWorm" was 
intended for younger and/or academically at-risk students, whereas "NewFly" was intended for 
older and/or college-bound students As shown in the sample problems in Appendix A, the 
NewWorm provided some explicit genotype/phenotype relationships to scaffold the most basic 
understanding (i.e., the relationship is provided for the body-type characteristic but not for mouth 
type). On the NewFly assessment none of these relationships were provided for any of the 
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characteristics. While we used both of the assessments in the inquiry described here, we chose to 
use only the NewWorm in our subsequent investigations because it captures a broader range of 
expertise. 

As shown in Table 2, we systematically varied the level of domain reasoning in our 
assessments along two dimensions. The type of reasoning assessed ranged from the simple 
cause-to-effect problems traditionally associated with secondary genetics instruction, to the more 
complex effect-to-cause problems that require the higher-level reasoning associated with domain 
expertise (and ostensibly afforded by GenScope). Both cause-to-effect and effect-to-cause 
reasoning were assessed in within-generation problems and in (more complex) between- 
generation problems. As shown by the vertical axis of Table 2, items within the various problem 
types ranged from the simple aspects of inheritance to the more complex aspects such as sex- 
linkage. 

The cause-to-effect between-generation problems (the classic Mendelian inheritance 
problems that represent the typical extent of introductory genetics) varied on several additional 
dimensions. We included both categorical (yes, maybe, no) and more difficult proportional ( 0 , 

1/8, 1/4/ 12, etc.) reasoning, and both monohybrid and (more difficult) dihybrid inheritance 
Additionally, dihybrid inheritance included both unlinked and (more difficult) linked genes. 

In keeping with the contemporary assessment perspectives outlined above, our 
assessments were designed to scaffold student problem solving across the increasingly difficult 
items Specifically, we expected that solving the simpler initial problems would leave students 
with understanding (e g., of the organism, our representational scheme, etc ) and self-confidence 
needed to solve the much more difficult problems later on 

Inquiry Framework 

Our validity inquiry was organized around Messick's (1995) "six distinguishable aspects 
of construct validity" Table 3 provides a detailed description of the six and a list of the validity 
issues associated with each Following the interpretive argument approach to validity inquiry 
advanced by Kane ( 1992) and Shepard (1993), we first defined the arguments that we anticipated 
making with or about our assessment system, and then exhaustively considered the potential 
threats to the validity of those arguments. Research priorities were established by weighing our 
concern over the particular threat with the resources needed to investigate it The nature of our 
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inquiry ranged from incidental to explicit. The middle column of Table 3 summarizes inquiry 
methods for each aspect, and additional methodological details follow 

Readers should note that the segmented presentation does not imply the existence of six 
different types of validity. Like Messick’s characterization, our inquiry reflects a unified 
concept of validity As such, validity can neither rely on nor require any one form of evidence, 
and some forms of evidence may be forgone for other forms of evidence: "What is required is a 
compelling argument that the available evidence justifies the test interpretation and use” 

(Messick, 1995, p 744). Our investigation bears out Messick’s argument that the distinction 
between the six aspects provides “a means of addressing functional aspects of validity that help 
disentangle some of the complexities inherent in appraising the appropriateness, meaningfulness, 
and usefulness of score inferences” (1995, p. 744). 

Content-related inquiry. The “content relevance, representativeness, and technical 
quality” (Messick, 1995, p. 745) of our assessment system was implicitly supported by having a 
nationally recognized content expert (the third author) lead the assessment team, and via routine 
feedback from teachers and content experts on the development/ implementation team Once the 
assessments were developed, content was explicitly validated via review by outside content 
experts (both university-based science education researchers with extensive secondary biology 
teaching experience) and by comparing the assessments to the the biology content standards 
published by the National Research Council (1996) 

In a somewhat novel aspect of our inquiry, we developed and validated a framework for 
documenting the degree of transfer represented by particular assessment performances This 
framework was used to consider whether the curriculum activities and the classroom teaching 
practices corrupted the assessment activities (i e , by reducing complex problem solving 
activities into simple algorithm or pattern recognition exercises) First we documented the 
number and nature of transformations between the assessment environment and the GenScope 
environment (and other likely comparison genetics learning environments) We further validated 
our assumptions about the learning environment by observing selected sessions in GenScope 
classrooms and interviewing the teacher When paired with the results from the substantive 
inquiry (described below), this inquiry yielded a detailed framework for considering the degree 
of transfer represented by particular levels of assessment performance 
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Structural, external, and generalizability inquiry The “fidelity of the scoring structure to 
the structure of the construct domain”, “extent to which scores’ relationships with other measures 
and behavior reflect domain theory” and “extent to which score properties and interpretations 
generalize” was examined primarily via assessment scores from 13 high-school Biology 
classrooms before and after genetics instruction (including three classrooms where GenScope 
was implemented). These scores were analyzed using multi-faceted Rasch scaling (Linacre, 
1989). This scaling method locates each assessment item and each individual’s pretest or 
posttest performance on one linear scale. This yields an estimate of the relative “difficulty” of 
each item and the relative level of proficiency represented by each student’s test performance, all 
using a common metric, along with data indicating the precision of the entire scale as well as 
each individual’s and each item’s fit on that scale. 

While piloting the first version of the instrument, the item fits were used to flag 
potentially problematic items (e g., items answered correctly by the less proficient students 
and/or incorrectly by the more proficient students), these items were examined and some were 
revised or removed In the present inquiry, the scale scores for each item were used to validate 
our assumptions about domain structure, primarily by documenting whether increasingly more 
expert or more complex items were, in fact, more difficult. Similarly, the scale scores for each 
individual’s pretest or posttest were used to validate assumptions about expected group 
differences and the effects of instruction The reliability indices associated with the entire set of 
items and with the entire set of individuals informed assumptions about generalizability 
Additionally, inter-rater reliabilities calculated by having multiple scorers score a subset of the 
assessments were used to validate the structural assumptions inherent in the scoring key. 

Substantive inquiry. The ‘Theoretical rationale for response consistency” was examined 
with a variety of methods Before completing our assessment, students in the two GenScope 
pilot classrooms completed a “very near-transfer” GenScope quiz that assessed their ability to 
solve versions of selected assessment problems created using screen captures of the GenScope 
environment and the familiar GenScope dragons. Student performance on these items was 
examined in light of the GenScope curricular activities to determine whether students were 
actually learning the underlying domain concepts while completing those activities (our initial 
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observations suggested that many were not). Then, each student’s performance on GenScope 
quiz items was examined in light of that student’s performance on the corresponding NewFly 
items. It was expected that some (but not all) of the students who were able to solve a particular 
problem on the GenScope quiz would fail to solve the corresponding (“far transfer”) problem on 
the NewFly quiz Conversely, it was expected that few students would fail to solve a problem on 
the GenScope quiz but correctly solve the corresponding problem on the NewFly assessment 

When the NewFly posttest was administered in the GenScope pilot classroom, 
substantive validity was further investigated using videotaped think-alouds of four students 
solving the assessment problems and using videotaped interviewer probes of apparent 
understanding of assessed concepts in ten additional students In the former, the first author 
provided an explanation of thinking aloud and a short practice session (following Ericsson and 
Simon, 1982) and then prompted students to continue thinking aloud as they progressed through 

A 

their posttests. In the latter, students were videotaped while the interviewer went over already- 
completed posttests, probing the reasoning behind each response by gently challenging students 
on correct answers and providing hints and scaffolding for incorrect answers. Both procedures 
were used to help validate the accuracy of our interpretation of scores by looking for ways that 
students who seemed to understand the targeted concept failed to solve the corresponding 
problems (i.e., “construct irrelevant difficulty”) and for ways that students got correct answers 
without the requisite understanding (“construct-irrelevant easiness”). The latter often occurs 
when critical aspects of test materials are well known to only some examinees, leading to 
invalidly high scores for those individuals. It was particularly important for us to look for 
construct-irrelevant easiness that might have been caused by the GenScope learning 
environment, as this would invalidate comparisons of learning between GenScope and 
comparison classrooms. 

Consequential inquiry. The consequences of our assessment practice were considered 
relative to (a) the GenScope software and associated curriculum, (b) the learning environments in 
the three implementation classrooms, and (c) the students in those three classrooms We first 
documented the changes in the curriculum and the software itself that could reasonably be 
attributed to the assessment development efforts and early assessment results Then, in light of 
our assessment activity, we observed the learning environments, administered short surveys 
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alongside the assessments, interviewed two GenScope teachers, and interviewed five students in 
the GenScope implementation classrooms 



Results 

Reflecting our interpretive approach, our findings consist of warrants for the arguments 
we wished to make, rather than positivist "proofs" of validity. 



Content- Related Validity 

Our analysis revealed that our assessments covered only a portion of the genetics content 
in the secondary school biology standards developed by the National Research Council (NRC, 
1996) In light of the broad focus of the content standards, our assessments represented a 
narrower focus on reasoning about inheritance There were other aspects of genetics that were 
included in the GenScope curriculum (and many more that could have been included but were 
not). However, we elected to focus more specifically on what was emerging as the core of the 
GenScope curriculum and what is often the entire scope of secondary genetics instruction The 
two outside experts confirmed that our coverage of this aspect of the domain was very thorough, 
both in terms of the topics and the scope of reasoning around those topics While beyond the 
scope of the present research, our efforts highlight the tension between depth and breadth in 
curriculum and assessment practice and standards 

The general consensus of the members of the assessment team and the larger GenScope 
team, along with the implementation teachers and the outside experts was that very few 
secondary school students ever develop the level of expertise represented by the most 
challenging problems on the assessment. This was appropriate given the potential afifordances of 
GenScope environment, our expectation that the design of the assessment would also scaffold 
student performance, and the need to capture the entire range of proficiency in our sample 

Regarding transfer, our examination revealed the number and nature of transformations 
that GenScope students had to negotiate in order to succeed in the assessment environment 
These included organism (e g , GenScope dragons vs the NewWorm organism in our 
assessment), traits (e g., dragon s horns vs. NewWorm’s body shape), representation (e g , 
GenScope windows vs paper and pencil representations of those windows vs conventional 
genetics diagrams, text, etc), genotypic configuration (the AT females in most organisms and in 
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the NewFly assessment and the XY males in GenScope and the NewWorm assessment), social 
context (working with other students vs working individually), and motivational context (the 
typically ungraded GenScope activities vs. graded assessment performance). We concluded that 
successful assessment performance represented a non-trivial transfer of understanding from the 
GenScope environment or other conceivable genetics learning environments. 

Structural Validity 

The fit indices and scale scores derived from the Rasch scaling were examined to validate 
our assumptions about the development of expertise in the domain that we attempted to represent 
within and across the different types of problems. Item fit indices show how well the relative 
difficulty of the various items was explained by the Rasch model. Based on a standard normal 
curve, we would expect 95% of the items to fall within ± 2.0 SD; the number of items in excess 
of 5% outside of this range indicates the presence of variance that is not explained by the Rasch 
model. On the NewFly, 40 of 56 items (71%) had standardized infit MSE within ± 2.0 SD (and 
52 of 56 within ± 3.0 SD). Reflecting the fact that it was in essence a further refinement of the 
NewFly assessment, the fits for the NewWorm were better 50 of the 60 items (83%) had 
standardized infit MSE within ± 2 0 SD (and 57 of 60 items within ± 3 0 SD) The Rasch 
modeling also confirmed that our assessments captured a broad range of proficiency The 
separation index (a measure of the spread of the estimates relative to their precision) was over 
5 0 for both instruments Loosely interpreted, this means that the precision of our assessments 
allow us to differentiate between five statistically significant intervals of proficiency in these 
populations. This is supported by the fact that the reliability of the separation index for the items 
was 96 for NewWorm and 84 for NewFly, confirming that we had a wide range of item 
difficulty Similarly, the Rasch model revealed high reliabilities for students ( 79 for NewWorm; 
87 for NewFly) indicating that these items were able to distinguish between student 

Of primary interest in our analysis was the structure of the construct as revealed by the 
relative difficulties of the items within the assessments, in light of our assumptions about the 
development of domain expertise As described earlier (and shown on Table I ), we started with 
strong assumptions about the relative difficulty of the various items Figure 2 and Figure 3 show 
how the mean difficulties of the various clusters of NewWorm and NewFly items validated those 
assumptions First, we note that the item structure was generally replicated across the two 
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instruments. Across aspects of inheritance (i.e , from left to right), effect-to-cause reasoning was 
more difficult than cause-to-effect, and between-generation was more difficult than within- 
generation 1 Across reasoning types (bottom to top), items involving complex aspects of 
inheritance (i.e., X-linkage) were more difficult than items involving simpler aspects 2 
Additionally dihybrid inheritance items involving linked alleles (thus requiring understanding of 
meiotic events to solve) were much more difficult than items that did not include genetic linkage 

Additional results not shown in Figure 2 and 3 further confirm our assumptions about the 
relative item difficulties For the between-generation cause-to-effect problems (i.e., traditional 
Mendelian inheritance problems), items requiring probabilistic (e g., l/l, '/z, % ) reasoning were 
more difficult that items requiring categorical (yes, maybe, no) reasoning (+ 1.57 vs -1.1 1 logits 
for NewWorm, + 1 15 vs. - .52 for NewFly). The item indices also confirmed that the items 
involving alleles for which we provided explicit genotype-phenotype relationships on the 
NewWorm (in order to scaffold very basic understanding) were easier than items that required 
the student to infer genotype-phenotype relationship (-2 42 vs. -1 .47 logits for the cause-to- 
effect within-generation problems). 

Substantive Validity. 

Regarding our interpretation of assessment scores as evidence of understanding, our 
examination of students’ answers on the NewFly assessment relative to their answers on the 
GenScope quiz revealed only a handful of cases where students failed to solve one of the “very 
near transfer” problems on the GenScope quiz yet provided a correct answer on the 
corresponding NewFly problem. Conversely, on each of the GenScope quiz items, only a subset 
of the students who solved a given problem went on to solve the corresponding problem on the 
NewFly assessment. This indicates both that our assessment minimized variance due to factors 
that we considered irrelevant to domain reasoning and that the assessment problems did require a 
reasonable transfer of understanding from the GenScope learning environment Our 



1 Inadvertantly, the within-generation cause-to-effect items were not included this version of the assessment. 

2 An exception on both instruments was for the within-generation effect-to-cause problems, where problems 
involving X-linked genes were less difficult than problems involving autosomes. However, these are very simple 
problems that can be solved with little or no domain knowledge — essentially by identifying the appropriate 
phenotype (the expression of the trait, such as flat vs. round body) for a given allelic combination (e g.. BB. Bb or 
bb) In retrospect, it is not surprising that the difficulties for such items do not fully reflect our assumptions about 
domain reasoning. 
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observations of two classrooms where GenScope was implemented and examination of the 
existing curricular activities further validated our assumptions about the degree of transfer 
represented by the various assessment items. We determined that the few specific assessment 
items that might have been corrupted by particular kinds of instruction, were, in fact not 
corrupted 3 

The interviews and think-alouds generally revealed that students solved the NewFly 
problems the way we expected (except for one problem that was subsequently eliminated) 

With the exception of the most simple problems at the beginning of the assessment, there was 
little evidence of students “guessing” the correct answer. However, there were several of 
examples of students using the various cues included in the items (and in some cases using their 
answers to previous items) to figure out the correct answer to the more difficult problems. Given 
that these individuals could not be said to have initially “known” the answer, this might be 
characterized as “guessing” from a conventional assessment perspective; given that we designed 
the assessment to scaffold student problem solving, we view such instances as further validation 
of our assumptions about domain reasoning (e g., that experts rely extensively on precisely such 
scaffolding to solve domain problems) and initial evidence of positive consequential validity. 
This illustrates the paradox identified by Wiggins (1993) whereby the complexity of the 
assessment context is made manageable by the clues in that same context We also found that 
even with extensive interviewer probing and scaffolding, students were generally not able to 
provide a correct answer (or demonstrate the target understanding) for items that were initially 
answered incorrectly Thus we concluded that the complex context of the assessment 
successfully scaffolded domain reasoning without introducing construct-irrelevant easiness 
Given that construct-irrelevant variance has been identified as the major threat to the validity of 
this type of complex performance assessments (Messick, 1995), these are key findings in our 
inquiry. 

External Validity 

These results concern the correspondence of students’ assessment scores with other 

3 We were particularly concerned with the single-generation pedigree problems that ask whether a particular 
parental-offspring triad represented a dominant, recessive, or indeterminate mode of inheritance, and with the 
dihybrid inheritance problems involving linked alleles. Particular instructional treatment might have reduced these 
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